Use mozinor for classification

Import the main module


In [1]:
from mozinor.baboulinet import Baboulinet


/home/jwuthri/anaconda3/lib/python3.6/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)

Prepare the pipeline

(str) filepath: Give the csv file
(str) y_col: The column to predict
(bool) regression: Regression or Classification ?
(bool) process: (WARNING) apply some preprocessing on your data (tune this preprocess with params below)
(char) sep: delimiter
(list) col_to_drop: which columns you don't want to use in your prediction
(bool) derivate: for all features combination apply, n1 * n2, n1 / n2 ...
(bool) transform: for all features apply, log(n), sqrt(n), square(n)
(bool) scaled: scale the data ?
(bool) infer_datetime: for all columns check the type and build new columns from them (day, month, year, time) if they are date type
(str) encoding: data encoding
(bool) dummify: apply dummies on your categoric variables

The data files have been generated by sklearn.dataset.make_classification


In [2]:
cls = Baboulinet(filepath="toto.csv", y_col="predict", regression=False)

Now run the pipeline

May take some times

In [3]:
res = cls.babouline()


Reading the file toto.csv
Read csv file: toto.csv
args: {'encoding': 'utf-8-sig', 'sep': ',', 'decimal': ',', 'engine': 'python', 'filepath_or_buffer': 'toto.csv', 'thousands': '.', 'parse_dates': ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'predict'], 'infer_datetime_format': True}
Inital dtypes is a          float64
b          float64
c          float64
d          float64
e          float64
f          float64
g          float64
h          float64
predict      int64
dtype: object
Work on PolynomialFeatures: degree 1
Optimal number of clusters
(10000, 9)

    Polynomial Features: generate a new feature matrix
    consisting of all polynomial combinations of the features.
    For 2 features [a, b]:
        the degree 1 polynomial give [a, b]
        the degree 2 polynomial give [1, a, b, a^2, ab, b^2]
    ...


    ELBOW: explain the variance as a function of clusters.

Optimal number of trees
    OOB: this is the average error for each training observations,
    calculted using the trees that doesn't contains this observation
    during the creation of the tree.

Estimator ExtraTreesClassifier
    ExtraTreesClassifier: as in random forests, a random subset of candidate
    features is used, but instead of looking for the most discriminative
    thresholds, thresholds are drawn at random for each candidate feature and
    the best of these randomly-generated thresholds is picked as
    the splitting rule.

Fitting 3 folds for each of 10 candidates, totalling 30 fits
[Parallel(n_jobs=1)]: Done  30 out of  30 | elapsed:    9.1s finished
   Best params => {'n_estimators': 100, 'min_samples_split': 4, 'min_samples_leaf': 1, 'max_features': 0.6, 'criterion': 'entropy', 'bootstrap': False}
   Best Score => 0.865
Estimator XGBClassifier
    Gradient boosting is an approach where new models are created that predict
    the residuals or errors of prior models and then added together to make
    the final prediction. It is called gradient boosting because it uses a
    gradient descent algorithm to minimize the loss when adding new models.

Fitting 3 folds for each of 10 candidates, totalling 30 fits
[Parallel(n_jobs=1)]: Done  30 out of  30 | elapsed:   38.4s finished
   Best params => {'subsample': 0.9, 'n_estimators': 50, 'min_child_weight': 6, 'max_depth': 8, 'learning_rate': 0.5}
   Best Score => 0.855
Estimator KNeighborsClassifier
    KNeighborsClassifier: Majority vote of its k nearest neighbors.

Fitting 3 folds for each of 10 candidates, totalling 30 fits
Fitting 3 folds for each of 4 candidates, totalling 12 fits
[Parallel(n_jobs=1)]: Done  12 out of  12 | elapsed:    4.1s finished
   Best params => {'n_neighbors': 17, 'p': 2, 'weights': 'distance'}
   Best Score => 0.853
Estimator DecisionTreeClassifier
    Decision Tree Classifier: poses a series of carefully crafted questions
    about the attributes of the test record. Each time time it receive an answer,
    a follow-up question is asked until a conclusion about the calss label
    of the record is reached.

Fitting 3 folds for each of 10 candidates, totalling 30 fits
[Parallel(n_jobs=1)]: Done  30 out of  30 | elapsed:    2.0s finished
   Best params => {'min_samples_split': 5, 'min_samples_leaf': 2, 'max_depth': 10, 'criterion': 'entropy'}
   Best Score => 0.750
Check the decision tree: 2017-08-1813:13:19.847449.png
Work on PolynomialFeatures: degree 2
Optimal number of clusters
dot: graph is too large for cairo-renderer bitmaps. Scaling by 0.880171 to fit


    Polynomial Features: generate a new feature matrix
    consisting of all polynomial combinations of the features.
    For 2 features [a, b]:
        the degree 1 polynomial give [a, b]
        the degree 2 polynomial give [1, a, b, a^2, ab, b^2]
    ...


    ELBOW: explain the variance as a function of clusters.

Optimal number of trees
    OOB: this is the average error for each training observations,
    calculted using the trees that doesn't contains this observation
    during the creation of the tree.

Estimator ExtraTreesClassifier
    ExtraTreesClassifier: as in random forests, a random subset of candidate
    features is used, but instead of looking for the most discriminative
    thresholds, thresholds are drawn at random for each candidate feature and
    the best of these randomly-generated thresholds is picked as
    the splitting rule.

Fitting 3 folds for each of 10 candidates, totalling 30 fits
[Parallel(n_jobs=1)]: Done  30 out of  30 | elapsed:   17.0s finished
   Best params => {'n_estimators': 50, 'min_samples_split': 3, 'min_samples_leaf': 1, 'max_features': 0.1, 'criterion': 'gini', 'bootstrap': False}
   Best Score => 0.857
Estimator XGBClassifier
    Gradient boosting is an approach where new models are created that predict
    the residuals or errors of prior models and then added together to make
    the final prediction. It is called gradient boosting because it uses a
    gradient descent algorithm to minimize the loss when adding new models.

Fitting 3 folds for each of 10 candidates, totalling 30 fits
[Parallel(n_jobs=1)]: Done  30 out of  30 | elapsed:  1.3min finished
   Best params => {'subsample': 0.9, 'n_estimators': 100, 'min_child_weight': 7, 'max_depth': 4, 'learning_rate': 0.5}
   Best Score => 0.857
Estimator KNeighborsClassifier
    KNeighborsClassifier: Majority vote of its k nearest neighbors.

Fitting 3 folds for each of 10 candidates, totalling 30 fits
Fitting 3 folds for each of 4 candidates, totalling 12 fits
[Parallel(n_jobs=1)]: Done  12 out of  12 | elapsed:   36.3s finished
   Best params => {'n_neighbors': 11, 'p': 2, 'weights': 'distance'}
   Best Score => 0.853
Estimator DecisionTreeClassifier
    Decision Tree Classifier: poses a series of carefully crafted questions
    about the attributes of the test record. Each time time it receive an answer,
    a follow-up question is asked until a conclusion about the calss label
    of the record is reached.

Fitting 3 folds for each of 10 candidates, totalling 30 fits
[Parallel(n_jobs=1)]: Done  30 out of  30 | elapsed:    7.5s finished
   Best params => {'min_samples_split': 7, 'min_samples_leaf': 8, 'max_depth': 6, 'criterion': 'gini'}
   Best Score => 0.738
Check the decision tree: 2017-08-1813:18:56.364832.png
                                           Estimator     Score  Degree
0  (ExtraTreeClassifier(class_weight=None, criter...  0.864667       1
1  XGBClassifier(base_score=0.5, colsample_byleve...  0.856800       2
2  (ExtraTreeClassifier(class_weight=None, criter...  0.856667       2
3  XGBClassifier(base_score=0.5, colsample_byleve...  0.855333       1
4  KNeighborsClassifier(algorithm='auto', leaf_si...  0.853333       1
5  KNeighborsClassifier(algorithm='auto', leaf_si...  0.852933       2
6  DecisionTreeClassifier(class_weight=None, crit...  0.750400       1
7  DecisionTreeClassifier(class_weight=None, crit...  0.737867       2
    Stacking: is a model ensembling technique used to combine information
    from multiple predictive models to generate a new model.

task:   [classification]
metric: [accuracy_score]

model 0: [ExtraTreesClassifier]
    ----
    MEAN:   [0.86173333]

model 1: [XGBClassifier]
    ----
    MEAN:   [0.84853333]

model 2: [KNeighborsClassifier]
    ----
    MEAN:   [0.86053333]

model 3: [DecisionTreeClassifier]
  0%|          | 0/15 [00:00<?, ?it/s]
    ----
    MEAN:   [0.75666667]

Stacking 4 models: 100%|██████████| 15/15 [00:21<00:00,  1.63s/it]

The class instance, now contains 2 objects, the model for this data, and the best stacking for this data

To make auto generate the code of the model

Generate the code for the best model


In [4]:
cls.bestModelScript()


Check script file toto_solo_model_script.py
Out[4]:
'toto_solo_model_script.py'

Generate the code for the best stacking


In [5]:
cls.bestStackModelScript()


Check script file toto_stack_model_script.py
Out[5]:
'toto_stack_model_script.py'

To check which model is the best

Best model


In [6]:
res.best_model


Out[6]:
Estimator    (ExtraTreeClassifier(class_weight=None, criter...
Score                                                 0.864667
Degree                                                       1
Name: 0, dtype: object

In [7]:
show = """
    Model: {},
    Score: {}
"""
print(show.format(res.best_model["Estimator"], res.best_model["Score"]))


    Model: ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='entropy',
           max_depth=None, max_features=0.6, max_leaf_nodes=None,
           min_impurity_split=1e-07, min_samples_leaf=1,
           min_samples_split=4, min_weight_fraction_leaf=0.0,
           n_estimators=100, n_jobs=1, oob_score=False, random_state=None,
           verbose=0, warm_start=False),
    Score: 0.8646666666666667

Best stacking


In [8]:
res.best_stack_models


Out[8]:
Fit1stLevelEstimator    [(ExtraTreeClassifier(class_weight=None, crite...
Fit2ndLevelEstimator    DecisionTreeClassifier(class_weight=None, crit...
Score                                                              0.8736
Degree                                                                  1
Name: 0, dtype: object

In [9]:
show = """
    FirstModel: {},
    SecondModel: {},
    Score: {}
"""
print(show.format(res.best_stack_models["Fit1stLevelEstimator"], res.best_stack_models["Fit2ndLevelEstimator"], res.best_stack_models["Score"]))


    FirstModel: [ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='entropy',
           max_depth=None, max_features=0.6, max_leaf_nodes=None,
           min_impurity_split=1e-07, min_samples_leaf=1,
           min_samples_split=4, min_weight_fraction_leaf=0.0,
           n_estimators=100, n_jobs=1, oob_score=False, random_state=None,
           verbose=0, warm_start=False), XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1,
       gamma=0, learning_rate=0.5, max_delta_step=0, max_depth=8,
       min_child_weight=6, missing=None, n_estimators=50, nthread=-1,
       objective='multi:softprob', reg_alpha=0, reg_lambda=1,
       scale_pos_weight=1, seed=0, silent=True, subsample=0.9), KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=17, p=2,
           weights='distance')],
    SecondModel: DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=10,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=2,
            min_samples_split=5, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best'),
    Score: 0.8736